CIND820 Capstone Project:

Implementing Machine Learning Price Prediction with the Ames Housing Dataset

Setting Up Environment

Importing necessary libraries and modules

Setting up display options for pandas

1. Basic Data Understanding & Data Cleaning

Loading and checking over dataset

Checking for Duplicate Values

Setting the target variable

Basic Summary Statistics for Target

Creating indexes for discrete, continuous, ordinal, and nominal features while displaying the number of each type of feature.

1. Basic Summary Statistics for Discrete Features

Handling discrete NaNs

In the case of categories like Basement Full Bath and Garage Cars, we would expect that a NaN value indicates the absence of that particular feature. As such, we fill the missing values with 0.

Basic Summary Statistics for Continuous Features

Handling Continuous NaNs

Similar to discrete variables, NaN values for all of the continuous features here likely indicates an abscence of that particular feature. For example, Na values for Total Bsmt Square Footage or Masonry Veneer indicates that the house does not have any basement or veneer. Again, I fill these na values with 0.

There is something curious though about the large number of Na values for Lot Frontage. This seemed strange to me so I checked a random number of records with NaN values for Lot Frontage using the assessor records that are available via the City of Ames website.

After checking the assessor records(overhead photos), all of the houses in my sample clearly have lot frontage on a street. Some have irregular shapes such as crescent lots but most of these still have some square footage numbers for frontage. In other respects, the assessor records seem to match those in the dataset. This seems to be an error in data collection.

To resolve this in the present context, I will use the median value for lot frontage to fill formerly Na values.

Examining Value Counts for Ordinal Features

I create a quick function that displays all the value counts for a group of features.

Handling Ordinal NaNs

For most of the ordinal features an NaN value indicates the absence of that feature so I fill the missing values with 'None'.

There is one exception. One single record has a NaN value for Electrical. It is very unlikely that this means that the house has no electrical system. It is necessary then to check the record in question.

Given that this particular house was built in 2006, it is almost certain that the house would have a standard breaker box. Just to check this though I will take a quick look at electrical system by year built.

As I thought, the most recent house built without a standard breaker was built in 1965. It is safe then to assume that our mystery house has a standard breaker.

Examining Value Counts for Nominal Features

Handling Nominal NaNs

I fill NaN nominal values with 'None' as it is likely that these represent an absence of a feature such as a garage or alley access.

Again I came across a discrepancy. There are several houses that have 'None' for Mas Vnr Type yet there is a non-zero value for Mas Vnr Area.

Since there are only 5 records, I check the Ames City Assessor records again and examine photos of each of the houses and then fill im the appropriate Mass Vnr Type or alternatively set the Mas Vnr Area to 0.

There was one other discrepency that I came across. There is one record which has a Garage Type value and a non-zero Garage Area value but 0 for Garage Yr Blt.

To fix this I replaced the zero value for Garage Year Built with the Year Built value for the house (1910).

Dropping ID(Order) and Property Identifier(PID) from dataframe

Re-coding any ordinal features that were not already numeric levels

Re-coding categorical features that had been coded with numbers

Simplifying Multilevel Categorical Features

2. Exploratory Data Analysis

Scatter plots Showing Relationship Between Features of Expected Importance and Sale Price

We can see here that there is a linear positive relationship between Greater Living Area and Sale Price.

There are, however, some extreme outliers - specifically 3 houses with square footage over 4500.

There is also a linear relationship between Lot Area and Sale Price.

Again there are a number of outliers especially the 4 values greater than 100000

There is a linear relationship for the most part between Total Basement Square Footage and Sale Price. It does not appear to be quite as tight and there are a lot of 0 values. However, there doesn't seem to be any extreme outliers.

This plot between Garage Area and Sale Price seems quite similar in shape to the scatter plot for Total Basement Square Footage and Sale Price. It seems to be a fairly linear relationship with quite a few zero values and no extreme outliers.

In the last of the scatter plots, I wanted to see what plot would result if AboveGrade Living Area and Total Basement Square Footage were combined. The result is interesting as it shows a very strong linear relationship between Total Square Footage and Sale Price.

Removing Outliers I remove the 3 outliers for Above Ground Living Area and 4 outliers for Lot Area.

Added Total Square Footage Feature I combined Above Ground Living Area with Total Basement Square footage in a new feature - Total Indoor Square Footage.

Boxplots Showing Relationship Between Categorical Variables of Interest and Sale Price

House Style

First we take a look at the relationship between House Style and Sale Price. We can see that it is possible to purchase nearly any style of home for a price in a band of about $140,000 to $200,000 in Ames. The price band for one-story and two-story homes varies considerably starting under $50,0000 and going up to the mid-$300,000s. Interestingly, while 2.5 Finished and 2.5 Unfinished price bands are higher than that for 2Story homes, 2Story homes have a significantly higher ceiling. The remaining house styles have relatively tighter pricing bands.

Neighborhood

Next we look at Neighborhood. Immediately it is apparent that Stone Brook, Northridge Heights, and Northridge are the priciest neighborhoods with median home prices over $300,000. Greenhills is only slightly cheaper. Otherwise, homes can be purchased in most neighborhoods within our price band of $150,000 to $200,000. Aside from the upper class areas, only the upper-middle class neighborhoods of Somerset,Timber,Veenker and College Crescent have median prices above the $200,000 mark. Cheaper neighborhoods are Briardale, Meadow Village, Brookside, Old Town, Edwards, and South West of Iowa State University. It makes sense that Iowa DOT and Railroad is the cheapest place to buy a house in Ames.

Zoning

We can see that while homes in Residential Low-Density have a wider price-band, they generally tend to have higher prices than Residential High-Density and Residential Medium-Density. The Floating Village zoning has the highest prices. (Note: Floating Village here refers to a type of planned community with various amenities built-in to the neighborhood)

Not surprisingly, homes that are in Commercial, Industrial or Agricultural zones tend to be cheaper.

Here we can see the majority of homes sold in Ames between 2006 and 2010 were built since the 1960s with a considerable number built around the turn of the millennium.

This set of boxplots gives us a bit more detail in terms of the price of a home depending on year built. As is perhaps obvious, newer homes tend to be more expensive. It is interesting though that houses built in 1892 seem to fetch higher prices than we might expect.

In this chart we can see that house prices actually tend to decline in Ames in the 2006-2010 period with a fall in prices since 2007 which corresponds to the beginning of the US Housing Finance Crisis.

This period is significantly different than the long-term housing price trend in Iowa. As we can see in this graph provided by the Federal Reserve.

Frequency Distributions and Histograms

Looking at the frequency distribution of Sale Price we can see that our target variable has a significant right-skew.

I attempted to perform a log transformation on both the target variable (Sale Price) and significantly skewed independent variables. However, this resulted in both warnings and exceptions elsewhere in the code. This might be because log(x) approaches negative infinity as the value of x becomes closer to zero. (see http://onbiostatistics.blogspot.com/2012/05/logx1-data-transformation.html)

As an alternative, I utilized a log(x+1) transformation. (see https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-models)

This shifts the distribution to the right resulting in greater normality.

Looking at our histograms of discrete and continuous variables, we can see that most of the continuous features are right-skewed. It would be good then to transform the most skewed of these variables using log(price+1) as we will do with our target variable.

There are a lot of discrete features (along with Mas Vnr Area) that are dominated by 0 values. This makes sense as in most cases homes do not have a pool or an enclosed porch. Nevertheless, it is probably doubtful that these zero dominated features will be important when developing our predictive models.

I thought I would check a couple of individual histograms to see if any more detail emerges if we zoom in. We can see a bit more detail for these two features distributions but really the large number of zero values predominates.

Correlation Matrices

We begin by looking at an extremely large correlation matrix which includes all features in our data set.

While it is a bit chaotic, a few things stand out. First, there are some features which have a high level of collinearity. Most are fairly easy to understand: Positive: Pool Area & Pool Quality Garage Quality & Garage Condition Garage Area & Garage Cars Year Built & Year Garage Built Total Rooms Above Ground & Total Area Above Ground Basement Finish Type Square Footage & Basement Type Square Footage

Negative: Basement Unfinished Square Footage & Basement Full Bathrooms Basement Unifinished Square Footage & Basement Finish Type Square footage Year Built & Overall Condition

A bit more interesting is the close relationship between Total Basement Square Footage and First Floor Square Footage.

As well, there is a negative relationship between Enclosed Porch and Year Built.

Matrix of 22 Features Most Correlated with Target

Next we zoom in to look specifically at the 22 features with the highest level of correlation with our target variable - Sale Price.

We see that there are very strong correlations between Total Indoor Square Footage, Overall Quality, and AboveGround Living Area Square Footage.

Numeric Feature Correlation Matrix

Here we can again see problems of multicollinearity.
Particularly problematic are: Positive Garage Area and Garage Cars Total Basement Square Footage and 1st Flr Square Footage Total Rooms Above Ground and AboveGround Living Area

Negative Basement Unfinished Square Footage and Basement Finished Square Footage Basement Unfinshed Square Footage and Basement Full Bathrooms

We also see significant collinearity between Total Indoor Square Footage and AboveGround Living Area Square Footage, Total Basement Square Footage and 1st Floor Square Footage.

Categorical Feature Correlation Matrix Using Kendall Rank Method

Our correlation matrix using the Kendall Rank correlation for categorical features suggests that there is not a lot of monotonicity between categorical features.

The only variables that show a degree of monotonicity are Kitchen Quality with External Quality and Garage Condition with Garage Quality.

Selecting Basic Features

Based on the correlation matrices above I have selected 16 variables for a preliminary multiple linear regression model. I took the best correlated variables

Basic Features:

  1. Overall Quality
  2. AboveGrade Living Area Square Footage
  3. External Quality
  4. Total Basement Area Square Footage
  5. Garage Area
  6. Basement Quality
  7. Year Built
  8. Garage Finish
  9. Full Bathrooms
  10. Year Remodel/Addition
  11. Fireplace Quality
  12. Masonry Veneer Area
  13. Heating Quality
  14. Basement Finish 1 Square Footage
  15. Lot Frontage
  16. Lot Area

3. Data Preparation II

I separate our dataset into object and nonObject features. Quick check to make sure there are no Na values. I then one-hot encode the object features using .getdummies.

Next I take a brief look at some of the most skewed features among the nonObject features before performing log(price+1) transformation for features with greater than 0.6 abs skew value.

Finally, I perform a log(price+1) transformation on the target variable.

Creating Test and Train Sets

We concatenate our object and nonObject features and then create train and test sets.

Standardization

Next we standardized our data to prevent large discrepencies in scale between various features causing distortions.

4. Feature Selection

In this section I try two different methods of feature selection using the sklearn feature.selection module's SelectFromModel.

Random Forest Selector

By default the random forest regressor will identify features that produce significantly higher decreases in mean squared error than an average of all features.

Hyperparameter tuning

Through trial and error I found that progressively lowering max features (@each split), max depth of trees, and max samples (number of samples drawn from X to train each base estimator) works best. This resulted in progressively better RSME scores particularly on the training data(see final model below). However, when I pushed these values too low, suddenly the RSME for test shot up to 50,000.

In terms of number of trees, I found that 50,000 trees produced very slightly better results than 10,000 but taking about double the amount of time. As such I went with the 10,000 trees.

Through all my trials the threshold for selection stayed pretty stable at 0.0038

The number of selected features went up considerably with hyperparameter tuning - from 17 intially to 71. As the number of features increased, the relative importance of most important features dropped from 0.4 & 3.7 for Overall Quality and Total Indoor Square footage to just below 0.025 (Overall Quality) and just over 0.025(Total Indoor Square Footage).

I think that by applying a more rigorous method of hyperparameter fine tuning it may be possible to further optimize this selector.

Lasso Selector

For the LassoCV regressor, I set intial alpha values then test them to obtain optimal alpha scores.

I then take my improved model and use it with SelectFromModel. The key hyperparameter here is the threshold value which represents the cut off in terms of significant features. At a high level such as 0.25, the lasso model returns zero features. In order to tune the threshold parameter, I implemented a while loop that progressively lowered the value until 20 features were obtained. Ultimately the threshold dropped to 0.0329999.

The contrast between the variables selected by the Random Forest selector and the Lasso selector is very interesting.

The Random Forest selector tended to pick mainly numeric(nonObj) variables and the most important of these tend to be somewhat similar to what we might expect based on correlations with Sale Price.

While the Lasso Selector did choose variables such as AboveGround Living Area and Overall Quality, it also chose a large number of categorical(obj) variables. Some of these were judged to be very important such as Neighborhood_Crawford, Exterior 1st_BrkFace, and Sale Condition Abnormal. I found it particularly interesting that Neighborhoods assumed such an increased importance as this group of features had been ignored by the Random Forest Selector.

I believe this points to the Random Forest selector favoring numeric variables with a high level of cardinality.

5. Preliminary Regression Models

i. Basic Multiple Linear Regression Model Using Basic Features

Originally, I had intended to include a Multiple Linear Regression model that included all 81 features. This though did not work at all and resulted in a broken model. While the RSME for the test data was fairly normal, the RSME for the test data ballooned (15551720438.809042). The R2 for the training data was -3.845149 which indicates that the model is substantially worse than the mean value of Sale Price as a predictor.

As an alternative for my base linear model, I took the top variables correlated with Sale Price minus several variables that had a high level of collinearity. This left me with 16 variables for the base model.

Overall, the base model performed fairly well. The RSME for the training data was a respectable 0.124412 and the R2 for the test data was 0.898423.

We can see that the residuals are fairly tightly packed around the mean which indicates we don't need to worry about heteroscedasticity.

Below we can see that the model works fairly well.

ii. Multiple Linear Regression Model with Random Forest Feature Selection

The multiple regression model with Random Forest selected variables does perform better pushing the RSME on the test data to 0.1156826. Of course, it includes a far larger number of features(71). Again the residuals are tighly grouped around the mean error.

iii. Lasso Regression Model

Here I diverge a little bit from my previous method of using the Lasso selected variables in a Multiple Linear Regression Model. Instead, I implement the Lasso Regressor.

Not surprisingly, it is the best performing of my intial models with an RSME on the test data of 0.106162.

What I wanted to check though was whether it would select the same variables as the SelectFromModel module implementation. It seems to have worked.